DSC 540 FINAL PROJECT

NAWAAZ SHARIF

MOHAMMED RASHIDUDDIN

SYED NOOR RAZI ALI

Dataset: Credit Card Transactions Fraud Detection Dataset

Link: https://www.kaggle.com/datasets/kartik2112/fraud-detection?select=fraudTrain.csv

In order to implement all the methods, we had to reduce the dataset so that it does not take time. We had tried running the models without reducing the dataset. Even after leaving it for more than 15 hours, it didn't complete running for Random Forest alone.

The dataset that we are working with has the information of transaction from the period of 1st Jan 2019 - 31st Dec 2020, some of which are legitimate and fraud. The columns in the dataset are:

  1. Transaction Date and Time.
  2. Credit Card Number.
  3. Merchant.
  4. Category.
  5. Amount.
  6. First Name.
  7. Last Name.
  8. Gender.
  9. Street Address.
  10. City.
  11. State.
  12. Zip.
  13. Latitude.
  14. Longitude.
  15. City population.
  16. Job.
  17. Date of Birth.
  18. Transaction number.
  19. Unix time.
  20. Marchant latitude.
  21. Marchant longitude.
  22. is_fraud

In today's time, people who are victims of frauds or scams have been increasing with the improvement of technology. With the help of some previous data, where we can learn different patterns of transactions and unusual activity, we can help avoid people as well as organizations to prevent themselves from being a victim of the scam.

From the given dataset, we have a column "is_fraud", which we will be training our models to learn from the train set, and then try to calculate the effectiveness of the models created by testing it on our testing dataset, to check if the model correctly distinguishes if the transaction is fraud or not.

This dataset has already been separated into train and test set.

Selecting less number of rows for the 'is_fraud' == 0, and keeping all the rows for the 'is_fraud' == 1, and then merging them together to perfrom the machine learning algorithms.

In the training set of the data, the number of rows with 'is_fraud' == 0 are 32236, and the number of rows with 'is_fraud' == 1 are 7506.

From the above dtypes, we can see that most of the variables are object variables.

DATA CLEANING

There are no null values in our dataset.

We will now make a subset of the features/columns that we will be using in our analysis or project.

EXPLORATORY DATA ANALYSIS

DATA VISUALIZATION

From the above histograms, seeing the subplot of 'category', we can see that more of the transactions that we have are from the category of grocery_pos, shopping_net, gas_transport. Also, seeing the subplot of 'gender' we can see that more transactions are made by 'females'.

**The top categories of places where transactions took place are:

1. grocery_pos  
2. shopping_net  
3. gas_transport   
4. shopping_pos  
5. home  
6. kids_pets   
7. misc_net  
8. entertainment  
9. personal_care  
10. food_dining  
11. health_fitness  
12. misc_pos  
13. grocery_net   
14. travel** 

From the above pie chart, we can see the percentage of each category that a category takes place.

From the above correlation map, we can infer that:

  1. amt and unix_time have good correlation of 0.546401
  2. amt and is_fraud are strongly positively correlated 0.655558
  3. zip and long are strongly negatively correlated -0.908855
  4. lat and merch_lat are strongly positively correlated 0.993619
  5. unix_time and is_fraud are strongle positively correlated 0.821524

DATA PREPARATION

Getting the dummy variables for the columns 'category' and 'gender'.

Importing the test set

Selecting less number of rows for the 'is_fraud' == 0, and keeping all the rows for the 'is_fraud' == 1, and then merging them together to perfrom the machine learning algorithms.

From the above pie chart, we can see that the highest number of transactions where is_fraud = 1, is for health_fitness, misc_net and entertainment.

IMPLEMENTING DIFFERENT MACHINE LEARNING APPLICATIONS ON THE DATASET

1. LOGISTIC REGRESSION

The accuracy of training set by performing logistic regression is 0.938

The accuracy of testing set by performing logistic regression is 0.963

2. K-Nearest Neighbors Classifier

The accuracy of training set by performing KNeighbors Classifier is 0.995

The accuracy of testing set by performing Kneighbors Classifier is 0.959

3. Decision Tree Classifier

The accuracy of training set by performing decision tree is 0.995

The accuracy of testing set by performing decision tree is 0.968

From the above confusion matrix, we can infer that around 180 observations labeled incorrectly as is_fraud or not_fraud.

From the above confusion matrix, we can infer that around 750 observations labeled incorrectly as is_fraud or not_fraud.

4. Random Forest

The accuracy of training set by performing random forest is 0.995

The accuracy of testing set by performing random forest is 0.968

From the above confusion matrix, we can infer that around 180 observations labeled incorrectly as is_fraud or not_fraud.

From the above confusion matrix, we can infer that around 750 observations labeled incorrectly as is_fraud or not_fraud.

5. ADABOOST

The accuracy of training set by performing adaboost is 0.949

The accuracy of training set by performing adaboost is 0.967

From the above confusion matrix, we can infer that around 2000 observations labeled incorrectly as is_fraud or not_fraud.

From the above confusion matrix, we can infer that around 780 observations labeled incorrectly as is_fraud or not_fraud.

6. GRADIENT DESCENT

The accuracy of training set by performing gradient descent is 0.967

The accuracy of testing set by performing gradient descent is 0.978

From the above confusion matrix, we can infer that around 1300 observations labeled incorrectly as is_fraud or not_fraud.

From the above confusion matrix, we can infer that around 510 observations labeled incorrectly as is_fraud or not_fraud.

7. Linear Regression

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

8. Random Forest Regressor

By comparing the RMSE of training set and testing set, as the RMSE of train set is lower than the test set, we can say that there is not overfitting of the model.

Changing some hyperparameters of the Random Forest model.

Adding max_depth parameter

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

9. SVR

Linear SVM Regressor

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

Polynomial Kernel SVM Regressor

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

Also, by seeing the R2 values being negative, I think that the model did not work properly on the dataset. As Polynomial SVM can be applied for regression as well as classification tasks, it should have worked on the dataset, but it clearly did not work properly.

RBF Kernel SVM Regressor

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

10. PCA

From the above graph, we can see that 5 PC's explain 90% of the variance in the dataset, so we will be taking 5 PC's for our evaluation.

Linear Regression with PCA transformed data

By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.

RF Regression with PCA transformed data

By comparing the RMSE of training set and testing set, as the RMSE of test set is higher than that of train set, we can say that the model is overfitting(But theres not much of a higher difference between the RMSE's of train and test set).

SVM Regression with PCA transformed data

By comparing the RMSE of training set and testing set, as the RMSE of train set is somewhat higher than the RMSE of test set, we can say that the model is underfitting.

From the above ROC curve created we can see the performance for different models. The models that performed good are Decision Tree(0.94), Gradient Descent(0.94), KNN(0.92), AdaBoost Model(0.88), Random Forest Classifier(0.85) and Logistic Regression(0.85).

Out of all the models, we can choose Decision Tree(0.94), Gradient Descent(0.94), KNN(0.92) as they all have higher/similar accuracy.

If we had extra time, we would have perfromed our models on the whole dataset (1.5 million rows), but as mentioned at the start, with the whole dataset to implement a single algorithm(Random Forest) it took more than 15 hours and still didn't run.
Also, if given extra time, there are many similar datasets related to the fraud detection, we could have tried on different test sets, and check if the model still performs the way it does now.